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Foreword 



This Technical Specification (TS) has been produced by ETSI Technical Committee Satellite Earth Stations and 
Systems (SES). 

The contents of the present document are subject to continuing work within TC-SES and may change following formal 
TC-SES approval. Should TC-SES modify the contents of the present document it will then be republished by ETSI 
with an identifying change of release date and an increase in version number as follows: 

Version l.m.n 

where: 

• the third digit (n) is incremented when editorial only changes have been incorporated in the specification; 

• the second digit (m) is incremented for all other types of changes, i.e. technical enhancements, corrections, 
updates, etc. 

The present document is part 6, sub-part 2 of a multi-part deliverable covering the GEO-Mobile Radio Interface 
Specifications, as identified below: 

Parti: "General specifications"; 

Part 2: "Service specifications"; 

Part 3: "Network specifications"; 

Part 4: "Radio interface protocol specifications"; 

Part 5: "Radio interface physical layer specifications"; 

Part 6: "Speech coding specifications"; 

Sub-part 1: "Speech Processing Functions; GMR-1 06.001"; 

Sub-part 2: "Vocoder: Speech Transcoding; GMR-1 06.010"; 

Sub-part 3: "Vocoder: Substitution and Muting of Lost Frames; GMR-1 06.011"; 

Sub-part 4: "Vocoder: Comfort Noise Aspects; GMR-1 06.012"; 

Sub-part 5: "Vocoder: Discontinuous Transmission (DTX); GMR-1 06.031"; 

Sub-part 6: "Vocoder: Voice Activity Detection (VAD); GMR-1 06.032"; 
Part 7: "Terminal adaptor specifications". 
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Introduction 



GMR stands for GEO (Geostationary Earth Orbit) Mobile Radio interface, which is used for mobile satellite services 
(MSS) utilizing geostationary satellite(s). GMR is derived from the terrestrial digital cellular standard GSM and 
supports access to GSM core networks. 

Due to the differences between terrestrial and satellite channels, some modifications to the GSM standard are necessary. 
Some GSM specifications are directly applicable, whereas others are applicable with modifications. Similarly, some 
GSM specifications do not apply, while some GMR specifications have no corresponding GSM specification. 

Since GMR is derived from GSM, the organization of the GMR specifications closely follows that of GSM. The GMR 
numbers have been designed to correspond to the GSM numbering system. All GMR specifications are allocated a 
unique GMR number as follows: 

GMR-n xx.zyy 

where: 

xx.Oyy (z=0) is used for GMR specifications that have a corresponding GSM specification. In this case, the 
numbers xx and yy correspond to the GSM numbering scheme. 

xx.2yy (z=2) is used for GMR specifications that do not correspond to a GSM specification. In this case, only the 
number xx corresponds to the GSM numbering scheme and the number yy is allocated by GMR. 

n denotes the first (n=l) or second (n=2) family of GMR specifications. 

A GMR system is defined by the combination of a family of GMR specifications and GSM specifications as follows: 

• If a GMR specification exists it takes precedence over the corresponding GSM specification (if any). This 
precedence rule applies to any references in the corresponding GSM specifications. 

NOTE: Any references to GSM specifications within the GMR specifications are not subject to this precedence 
rule. For example, a GMR specification may contain specific references to the corresponding GSM 
specification. 

• If a GMR specification does not exist, the corresponding GSM specification may or may not apply. The 
applicability of the GSM specifications is defined in GMR-1 01.201 [7]. 
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1 Scope 



The present document provides a high-level algorithmic description of the vocoder [6] and integrated functions which 
include voice activity detection (VAD), tone detection, comfort noise insertion (CNI), forward error control (FEC), 
frame substitution (frame repeats), and frame muting. 



References 



The following documents contain provisions which, through reference in this text, constitute provisions of the present 
document. 

• References are either specific (identified by date of publication and/or edition number or version number) or 
non-specific. 

• For a specific reference, subsequent revisions do not apply. 

• For a non-specific reference, the latest version applies. 

[1] GMR-1 01.004 (ETSI TS 101 376-1-1): "GEO-Mobile Radio Interface Specifications; Part 1: 

General specifications; Sub-part 1: Abbreviations and acronyms; GMR-1 01.004". 

[2] GMR-1 05.003 (ETSI TS 101 376-5-3): "GEO-Mobile Radio Interface Specifications; Part 5: 

Radio interface physical layer specifications; Sub-part 3: Channel Coding; GMR-1 05.003". 

[3] GMR-1 05.008 (ETSI TS 101 376-5-6): "GEO-Mobile Radio Interface Specifications; Part 5: 

Radio interface physical layer specifications; Sub-part 6: Radio Subsystem Link Control; 
GMR-1 05.008". 

[4] GMR-1 06.012 (ETSI TS 101 376-6-4): "GEO-Mobile Radio Interface Specifications; Part 6: 

Speech coding specifications; Sub-part 4: Vocoder: Comfort Noise Aspects; GMR-1 06.012". 

[5] GMR-1 06.032 (ETSI TS 101 376-6-6): "GEO-Mobile Radio Interface Specifications; Part 6: 

Speech coding specifications; Sub-part 6: Vocoder: Voice Activity Detection (VAD); 
GMR-1 06.032". 

[6] GMR-1 06.001 (ETSI TS 101 376-6-1): "GEO-Mobile Radio Interface Specifications; Part 6: 

Speech coding specifications; Sub-part 1: Speech Processing Functions; GMR-1 06.001". 

[7] GMR-1 01.201 (ETSI TS 101 376-1-2): "GEO-Mobile Radio Interface Specifications; Part 1: 

General specifications; Sub-part 2: Introduction to the GMR-1 Family; GMR-1 01.201". 



3 Definitions and abbreviations 

3.1 Definitions 

For the purposes of the present document, the following terms and definitions apply: 

Voice Activity Detection (VAD): method of classifying short segments of speech as either "voice" or "background 
noise." The decision is based upon comparing the current level and spectral characteristics of the input signal with that 
of a typical level and spectral characteristics 

Comfort Noise Insertion (CNI): method of synthesizing low-level noise on the receive side during breaks in voice 
transmission. To increase the perceived voice quality, the synthesized noise has characteristics that are similar to the 
background noise present on the transmit side 

Forward Error Correction (FEC): method of introducing redundancy to binary data that allows for the detection 
and/or correction of errors introduced during transmission of that data 
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V/UV (Voiced/Unvoiced): each spectral band is declared either "voiced" or "unvoiced", depending upon the amount of 
periodic energy in that band. This voicing decision is frequently referred to as a V/UV decision 

frame: data representing a full 40 msec of continuous data input to or output from the vocoder. The frame data may 
consist of model parameters, quantized bits, FEC encoded channel data, or speech samples at various points in the 
vocoder 

subframe: data representing 10 msec of continuous data input to or output from the vocoder, or the result of processing 
that data through various points in the vocoder. For example, "The second subframe of model parameters is passed to 
the quantizer" is a valid use of the term as is "The decoder outputs one subframe of 8 kHz speech samples" 

subframe number: each frame is composed of four consecutive subframes that are each assigned a subframe number. 
The first, second, third, and fourth subframes within a frame are assigned subframe numbers 0, 1,2, and 3 respectively 

quantizer-frame: data representing the 20 msec of continuous vocoder data that is formed by combining subframes 
and 1 or subframes 2 and 3 

quantizer-frame number: each frame is composed of two consecutive quantizer-frames that are each assigned a 
quantizer frame number. The first and second quantizer-frames within a frame are assigned quantizer-frame numbers 
and 1 respectively 

voice frame: 40-msec frame that contains some voice data but no tone data. It may also contain comfort noise data 

SID frame: (Silence Descriptor): 20-msec frame that contains only comfort noise data. No voice or tone data may be 
present in a SID frame 

tone frame: 40-msec frame that contains tone data. It may also contain voice data or comfort noise data 

dBmO: Power in dBm referred to or measured at a zero transmission level point (0TLP) 

3.2 Abbreviations 

Abbreviations used in the present document are listed in GMR-1 01.004 [1]. 



4 Algorithm overview 

The present document provides a high-level algorithmic description of the vocoder and integrated functions. 

The basic methodology of the vocoder is to divide the speech signal into overlapping speech segments (or frames) usinj 
a window function. Each speech frame is then compared with the underlying speech model and a set of model 
parameters are estimated for that particular frame. The encoder quantizes these model parameters to produce a bit 
stream at the required data rate and frame size. Each frame is then error control coded. This bit stream is transmitted to 
the decoder, which decodes the error control codes and reconstructs the model parameters from the resulting data. The 
reconstructed model parameters are then used to generate a synthetic speech signal, which is the output of the voice 
decoder. 
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High-level algorithm description 



This clause provides a high-level algorithmic description for the vocoder [6], including the encoder, decoder, and FEC 
modules. 



5.1 



Voice encoder 



Figure 5.1 shows a block diagram for the encoder. 
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Figure 5.1 : Voice encoder block diagram with interface 

5.1 .1 Segmentation of the input signal 

The vocoder [6] inputs/outputs 50 quantizer-frames per second. Each quantizer frame represents a 20-msec segment of 
8 kHz sampled speech. Each quantizer-frame is composed of two subframes, each 10 msec in duration. The subframes 
are numbered and 1, where subframe represents the oldest speech segment and subframe 1 represents the most 
recent speech segment. Because the subframe is 10 msec in duration and the speech is sampled at 8 kHz, the length of 
each subframe is 80 samples. 

A 40-msec frame is composed of two consecutive 20-msec quantizer-frames. 

The first steps in the voice encoder are to input the digitized speech waveform, high pass filter it, and divide it into 
overlapping segments using a window function. The number of samples input to the encoder is normally 80, but may 
vary by ±4 samples to account for various possible timing issues. 

5.1 .2 Model parameter estimation 

The next step in the encoder is to estimate the model parameters for each segment beginning with the fundamental 
frequency. The fundamental frequency is estimated first using a filter bank technique. Next, the V/UV decisions in eight 
spectral bands are estimated using the same filter-bank outputs. Finally, the spectral magnitudes are estimated at each 
harmonic of the fundamental from the energy distribution of the short-term spectrum. 

For SID frames only the spectral magnitudes are estimated. 
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5.1 .3 Voice activity detection (VAD) 

The voice encoder contains an integrated VAD algorithm, which is used to support discontinuous transmission (DTX) 
in the GMR-1 system. Refer to GMR-1 06.031 [5] for further DTX information and requirements. 

The multifrequency VAD system determines whether voice or silence (i.e. background noise) is present. The VAD 
operates by estimating the short-term spectral characteristics of the noise and then comparing the spectrum of the 
current frame with the estimated background noise. When the difference is small or the absolute energy of the current 
frame is small for several consecutive frames, then the VAD declares the current frame to be silence. Otherwise the 
current frame is declared voice. During silence frames, the encoder encodes the background noise characteristics and 
outputs a special "silence frame" containing this information. When this information is received at the decoder it is used 
to regenerate comfort noise, which models the actual noise seen by the encoder. When the VAD determines voice is 
present, the encoder outputs a normal voice frame, or a special tone frame, as discussed below. GMR-1 06.032 [5] 
contains further information about VAD operation. 

5.1.4 Tone detection 

The encoder includes an integrated tone detector, which detects single tones, DTMF tones, or Call Progress tones in the 
input signal. When a tone is detected the encoder outputs a special "tone frame", which contains the tone amplitude and 
frequency information. When the decoder receives this information, it is able to reconstruct the specified tone. All 
16 standard DTMF tones and single tones between 150 Hz to 3 800 Hz can be detected. 

5.1 .5 Quantization of the model parameters 

The vocoder [6] is broken down into 4,0 kbps of signal information and 1,2 kbps of error control information. Each 
40-msec frame is composed of 160 signal bits and 48 error correction bits. The model parameters for two successive 
10-msec subframes are jointly quantized to generate a quantizer-frame. A quantizer-frame thus represents a 20-msec 
speech segment and there are two quantizer frames per 40-msec frame. Quantizing subframes and 1 generates 
quantizer-frame 0. Quantizing subframes 2 and 3 generates quantizer-frame 1 . Each quantizer frame contains 80 signal 
bits and 24 error correction bits. The 80 bits are distributed among the various voice model parameters, silence model 
parameters, or tone model parameters. 

5.1 .5.1 Voice frame quantization 

The 80 bits are distributed among the fundamental frequencies, the V/UV decisions, and the spectral magnitudes of the 
two subframes. 

5.1 .5.2 SID frame quantization 

The first 6 bits are used to indicate that the quantizer-frame is a SID frame, and the remaining 74 bits are distributed 
among the spectral magnitudes that were estimated for the two subframes. SID frame quantization meets the 
requirements specified in GMR-1 06.012 [4]. The actual transmission aspects of the SID frame on the radio link is 
described in GMR-1 05.008 [3]. 

5.1 .5.3 Tone frame quantization 

The first 6 bits are used to indicate that the quantizer-frame is a tone frame. Then 8 bits are used to quantize the 
amplitude of the tone, and the remaining bits are used to quantize the frequency characteristics of the tone. 

5.1.6 Bit prioritization 

The quantized bits are arranged into two classes. Bits in the first priority class will be error-protected, whereas the bits 
in the second priority class will receive no error protection. The prioritization scheme is dependent upon the frame type, 
i.e. voice, silence, or tone. Bits that are most sensitive to bit errors are placed in the first class, and bits that are less 
sensitive to bit errors are placed in the second class. There are 48 first class bits per quantizer-frame and 32 second class 
bits per quantizer-frame. 
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5.1 .6.1 Frame format for voice/SID frames 

The bits output by the voice encoder for voice or SID quantizer-frames are ordered as follows: 

- 9 pitch bits (most sensitive); 
6 gain bits; 

6 voicing bits; 
27 spectral bits; 
2 gain bits; 

- 30 spectral bits (least sensitive). 

5.1 .6.2 Frame format for tone frames 

The bits output by the voice decoder for tone quantizer-frames are ordered as follows: 
6 type bits (most sensitive); 
2 sequence bits; 
8 amplitude bits; 
64 frequency bits (least sensitive). 

5.2 FEC encoder 

The vocoder [6] contains an integrated convolutional encoder used for error-control coding the speech, SID, and tone 
data. The same algorithm is used regardless of the frame type. A high-level description is provided here. References 
GMR-1 05.003 [2] and vocoder [6] give further details. 

The 80 quantized and prioritized bits, output from the voice encoder, are passed to the FEC encoder that outputs 104 
error-protected bits. 

5.2.1 Convolutional encoder 

The convolutional encoder inputs 48 bits and outputs 72 bits. The 48 first-class bits output by the bit prioritizer are input 
to a rate 2/3 circular 64-state (k = 7) convolutional encoder that adds 24 error-correction bits. The convolutional 
encoder, being circular, does not utilize flush bits. Instead, the first 6 input bits are re-input at the end of the 
convolutional code. Thus, the trellis state is identical on 6th- and 54th-symbol intervals, which makes the trellis circular. 

5.2.2 Interleaver 

The 72 bits output by the convolutional encoder are combined with the 32 second class-bits output by the bit prioritizer. 
The bits are rearranged in order to disperse burst errors. The interleaver outputs 104 bits. 

5.3 FEC decoder 

The vocoder [6] contains an integrated Viterbi decoder, which is used for error-control decoding the speech, SID, and 
tone data. The same algorithm is used regardless of the frame type. A high-level description is provided here. 
References GMR-1 05.003 [2] and the vocoder [6] give further details. 

The FEC decoder inputs 104 error-protected bits and outputs 80 error-corrected bits to the voice decoder. 

5.3.1 Deinterleaver 

The deinterleaver inputs 104 bits and rearranges them for input to the Viterbi decoder. The unprotected second class bits 
are separated. The deinterleaver outputs 72 error-protected bits and 32 unprotected second class bits. 
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5.3.2 Viterbi decoder 

The 72 error-protected bits are passed into a Viterbi decoder that outputs 48 error-corrected first-class bits. Because the 
convolutional encoder used a circular encoding method the final state (the state at the 54th symbol interval) is not fixed. 
This situation requires that the Viterbi decoder perform as trellis expansion over a total of 90-symbol intervals even 
though there are only 48 input symbols. 

5.4 Voice decoder 

The voice decoder operates by inverting the steps of the voice encoder. Figure 5.2 shows a block diagram for the 
decoder. 
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Figure 5.2: Voice decoder block diagram with interface 



5.4.1 



Frame classifier 



The decoder first determines which frame type was received, i.e. voice, SID, or tone. The model parameters will be 
reconstructed based upon this classification and the appropriate model. The decoder classifies the frame based upon the 
first six bits in the received data. 

5.4.2 Reversal of bit prioritization 

After a frame is classified as a voice frame, a SID frame, or a tone frame, the bits are grouped together to reform the 
quantized model parameters for the appropriate model. This reforming reverses the prioritization that was necessary to 
ensure that the most sensitive bits were error-protected. 

5.4.3 Reconstruction of the model parameters 

The decoder next employs an inverse quantizer in order to reconstruct the model parameters from the quantizer frame. 
The model parameters and inverse quantizer used to perform this operation are dependent on the frame type. The 
resulting reconstructed model parameters will represent two subframes (each 10 msec) of speech. 



5.4.3.1 



Voice frame reconstruction 



The decoder reconstructs the fundamental frequency, voicing decisions, and spectral magnitudes for two subframes 
from the quantized model parameters by inverting the steps performed by the encoder. 
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5.4.3.2 SID frame reconstruction 

The decoder reconstructs the spectral magnitudes for two subframes from the quantized model parameters. The voicing 
decisions are all set to unvoiced and the fundamental frequency is set to a fixed value. The SID frame reconstruction 
meets the requirements specified in GMR-1 06.012 [4]. 

5.4.3.3 Tone frame reconstruction 

The decoder reconstructs the tone amplitude and frequency characteristics from the quantized model parameters. Then 
the tone model parameters are mapped to equivalent voice model parameters so that tone frames and voice frames can 
be synthesized using the same synthesizer. 

5.4.4 Smoothing and enhancement of model parameters 

After reconstructing the model parameters they are smoothed and enhanced in order to improve the perceived quality of 
the synthesized speech and reduce the impact of bit errors. 

5.4.5 Synthesis of the output signal 

Once the model parameters have been generated, they are used by the decoder to synthesize a single subframe of 
speech. Speech is synthesized as the sum of a voiced component and an unvoiced component. The voiced component 
represents the speech in the frequency bands declared voiced. It is computed using a bank of harmonic oscillators where 
the amplitude, phase and frequency of each oscillator are adjusted to meet a set of constraints imposed by the model 
parameters at successive frames. Similarly, the unvoiced component represents the speech in the frequency bands 
declared unvoiced. The unvoiced component is computed in the frequency domain and then combined with the 
contribution from previous segments in the time domain using the weighted overlap-add method. The decoder outputs 
the sum of the voiced and unvoiced components. The subframe length is allowed to vary by ±4 samples. 



6 Tone performance 

This Clause provides specifications for the tone synthesis, transmission, and regeneration algorithms. 

Input Dynamic Range 40 dB 

An input signal will not be rejected as a valid tone if the amplitude of both frequency components is within the specified 
limit of the amplitude of the maximum sinusoid that can be input without saturation (defined as +3 dBmO). 

Input Signal-to-Noise Ratio 15 dB (DTMF) 

30 dB (single tone < 500 Hz) 

20 dB (single tone > 500 Hz) 

In order for an input signal to correspond to a valid tone, the ratio of in-band to out-of-band energy must be greater than 
the specified limit. In-band energy is defined as energy in frequency components within ±3.5% of the two frequencies. 
Out-of-band energy is defined as the total energy minus the in-band energy. 

Minimum DTMF Frequency Tolerance ±1% 

An input signal will be accepted as a DTMF tone if both of its principal frequency components are within the specified 
limit from the ideal frequencies. 

Maximum DTMF Frequency Tolerance ±3,5% 

An input signal will be rejected as a DTMF tone if either of its principal frequency components are outside the specified 
limit of the ideal frequencies. 
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Normal DTMF Twist 6 dB 

An input signal will not be rejected as a DTMF tone if the energy contained within the low-frequency band is greater 
than the energy contained within the high-frequency band by an amount less than the specified limit. Each low and high 
frequency band is limited to ±3.5% of the nominal center frequencies. 

Reverse DTMF Twist 6 dB 

An input signal will not be rejected as a valid DTMF tone if the energy contained within the high-frequency band is 
greater than the energy contained within the low-frequency band by an amount less than the specified limit. Each low 
and high frequency band is limited to ±3.5% of the nominal center frequencies. 

DTMF Amplitude Accuracy ±2 dB 

For any input that corresponds to a valid DTMF code, the amplitude of each frequency component must be estimated to 
an accuracy of ±2 dB. 

Minimum Tone Duration 40 msec 

An input signal may be rejected as a tone if its time duration is less than the specified limit. The points at which the 
envelope is 20 dB below its peak value define the duration of the tone. 

Tone Onset/Offset Accuracy 40 msec 

The difference between the duration of a valid tone and the length of time the appropriate control flag is set will be less 
than the specified limit, where the duration of the tone is defined by the points at which the envelope is 20 dB below its 
peak value. 

Maximum Voice/Tone Transition 20 msec 

On the commencement or termination of a valid tone, no more than the specified limit of tone will be encoded and 
transmitted in voice mode. 

Voice/Tone Alignment ±5 msec 

The encoder delay in tone mode will be within the specified limit of the encoder delay in voice mode. 

Maximum DTMF False Alarm Rate 1 x 10 4 

For any non-DTMF (i.e. voice or other signal) input, the fractional number of frames detected as DTMF will not exceed 
the specified rate over any one-hour period. 

Minimum DTMF Recognition Rate 99,99% 

DTMF performance will be such that the specified correct digit recognition rate is achieved for all input signals meeting 
all other required DTMF specifications (SNR, twist, frequency, duration, etc.). 

Single Tone Frequency Accuracy ±32 Hz 

The center frequency of a valid single tone will be estimated to within the specified limit. 

Minimum Reconstructed DTMF Tone/Pause Duration 40 msec 

The reconstructed duration of a DTMF tone (i.e. in tone mode) or pause (i.e. not in tone mode) at the output of the 
decoder will not be less than the specified minimum in clear channel conditions. 

Minimum Reconstructed DTMF Parameters Twist ±1 dB 

Frequency Offset ±0,25% 

SNR 35 dB 

The characteristics of a reconstructed tone must be within the specified limits from the corresponding ideal tone. 
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7 Delay specifications 



The algorithmic delay of the vocoder [6] is 62 msec. This algorithmic delay is divided into an approximately 58 msec 
delay for the encoder and a 4 msec delay for the decoder. The algorithmic delay of the encoder includes 20 msec of 
buffering delay. 

The estimated processing delay is dependent on the processor used. Processor scheduling (i.e. time sharing a single 
processor), transmission and other system delays may introduce additional delay. 
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Annex A (informative); 
Bibliography 



GMR-1 06.031 (ETSI TS 101 376-6-5): "GEO-Mobile Radio Interface Specifications; Part 6: Speech coding 
specifications; Sub-part 5: Vocoder: Discontinuous Transmission (DTX); GMR-1 06.031". 
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